Name: Kiran Shrestha
Title: Data for health policy
Link to the page: here

Project Goals¶

The project primarily investigates the data related to health factors of each counties in USA. Health factors here include health behaviors, clinical care, socio-economic factors, physical enviornment and other health outcomes. Using available data along with additional public datasets, I plan to find the find possible discoveries regarding what variables are most responsible for health outcomes. I am sure there are metrics to measure like correlations to differentiate those. Using the variables, I plan to create a model and possibly test with new data sources.

Collaboration Plan¶

I plan to first find more datasets that I can relate this dataset to, and thus have more available dependent measures that could infulence the health outcomes. Maybe, the demographics, education quality, or presence or absence of certain institutions could add more light to the health results. Github will be primarily used to store all the data and notebooks.

Import libraries¶

In [3]:
pip install missingno 
Collecting missingno
  Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB)
Requirement already satisfied: numpy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.24.4)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.11/site-packages (from missingno) (3.7.2)
Requirement already satisfied: scipy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.11.2)
Requirement already satisfied: seaborn in /opt/conda/lib/python3.11/site-packages (from missingno) (0.12.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (4.42.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: pandas>=0.25 in /opt/conda/lib/python3.11/site-packages (from seaborn->missingno) (2.0.3)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
Installing collected packages: missingno
Successfully installed missingno-0.5.2
Note: you may need to restart the kernel to use updated packages.
In [2]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# import pycountry_convert as pc 
import missingno as mno
import warnings

Read the dataset¶

In [3]:
URL = "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2023_0.csv"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    df = pd.read_csv(URL, storage_options=headers);
In [7]:
# with warnings.catch_warnings():
#     warnings.simplefilter('ignore')
#     df = pd.read_csv("data/analytic_data2023_0.csv")
In [6]:
df.head()
Out[6]:
State FIPS Code County FIPS Code 5-digit FIPS Code State Abbreviation Name Release Year County Ranked (Yes=1/No=0) Premature Death raw value Premature Death numerator Premature Death denominator ... % Female raw value % Female numerator % Female denominator % Female CI low % Female CI high % Rural raw value % Rural numerator % Rural denominator % Rural CI low % Rural CI high
0 statecode countycode fipscode state county year county_ranked v001_rawvalue v001_numerator v001_denominator ... v057_rawvalue v057_numerator v057_denominator v057_cilow v057_cihigh v058_rawvalue v058_numerator v058_denominator v058_cilow v058_cihigh
1 00 000 00000 US United States 2023 NaN 7281.9355638 4125218 917267406 ... 0.5047067187 167509003 331893745 NaN NaN 0.193 NaN NaN NaN NaN
2 01 000 01000 AL Alabama 2023 NaN 10350.071456 88086 13668498 ... 0.5142542169 2591778 5039877 NaN NaN 0.409631829 1957932 4779736 NaN NaN
3 01 001 01001 AL Autauga County 2023 1 8027.3947267 836 156081 ... 0.513782892 30362 59095 NaN NaN 0.4200216232 22921 54571 NaN NaN
4 01 003 01003 AL Baldwin County 2023 1 8118.3582061 3377 614143 ... 0.5134771453 122872 239294 NaN NaN 0.4227909911 77060 182265 NaN NaN

5 rows × 720 columns

In [7]:
df.shape
Out[7]:
(3195, 720)
In [5]:
# Display top 10 columns
for col in df.columns[:10]:
    print(col)
State FIPS Code
County FIPS Code
5-digit FIPS Code
State Abbreviation
Name
Release Year
County Ranked (Yes=1/No=0)
Premature Death raw value
Premature Death numerator
Premature Death denominator

About dataset¶

health_graph.png

This page describes about the idea behind the dataset. This link has all the datasets from different years to download. This page has the all the further sources that were used for mining the ultimate data. The dataset has 700+ features to work with, although there are similarities among multiple columns and missing data.

Primarily, the data columns can be divided in to health factors and health outcomes.

Data Cleaning¶

In [9]:
# This plot shows the missing data
# Longer the bar, lesser the missing data 
mno.bar(df)
Out[9]:
<Axes: >

Drop the columns where missing value is more than 1000¶

In [10]:
for col in df.columns:
    if df[col].isnull().sum()>1000:
        df.drop([col], axis=1, inplace=True)
In [11]:
# cols from 720 to 326
df.shape
Out[11]:
(3195, 326)
In [9]:
mno.bar(df)
Out[9]:
<Axes: >

Extract the necessary columns¶

A lot of columns give repitative meaning. So, we extract the ones that is enough to represent the particular measurement.

In [13]:
# We need the raw values only
new_cols = [x for x in df.columns if "raw value" in x]
new_cols = list(df.columns[0:5]) + new_cols
In [14]:
# Replace % by percent
cols = list(map(lambda x:x.replace("%", "percent"), new_cols))
# Remove certain char and substring 
cols = list(map(lambda x:x.replace("-", " "), cols))
cols = list(map(lambda x:x.replace(" raw value", ""), cols))
cols = list(map(lambda x:x.replace(" ", "_"), cols))
cols = list(map(lambda x:x.replace(" ", ""), cols))
cols
Out[14]:
['State_FIPS_Code',
 'County_FIPS_Code',
 '5_digit_FIPS_Code',
 'State_Abbreviation',
 'Name',
 'Premature_Death',
 'Poor_or_Fair_Health',
 'Poor_Physical_Health_Days',
 'Poor_Mental_Health_Days',
 'Low_Birthweight',
 'Adult_Smoking',
 'Adult_Obesity',
 'Food_Environment_Index',
 'Physical_Inactivity',
 'Access_to_Exercise_Opportunities',
 'Excessive_Drinking',
 'Alcohol_Impaired_Driving_Deaths',
 'Sexually_Transmitted_Infections',
 'Teen_Births',
 'Uninsured',
 'Primary_Care_Physicians',
 'Dentists',
 'Mental_Health_Providers',
 'Preventable_Hospital_Stays',
 'Mammography_Screening',
 'Flu_Vaccinations',
 'High_School_Completion',
 'Some_College',
 'Unemployment',
 'Children_in_Poverty',
 'Income_Inequality',
 'Children_in_Single_Parent_Households',
 'Social_Associations',
 'Injury_Deaths',
 'Air_Pollution___Particulate_Matter',
 'Drinking_Water_Violations',
 'Severe_Housing_Problems',
 'Driving_Alone_to_Work',
 'Long_Commute___Driving_Alone',
 'Life_Expectancy',
 'Premature_Age_Adjusted_Mortality',
 'Frequent_Physical_Distress',
 'Frequent_Mental_Distress',
 'Diabetes_Prevalence',
 'HIV_Prevalence',
 'Food_Insecurity',
 'Limited_Access_to_Healthy_Foods',
 'Insufficient_Sleep',
 'Uninsured_Adults',
 'Uninsured_Children',
 'Other_Primary_Care_Providers',
 'High_School_Graduation',
 'Reading_Scores',
 'Math_Scores',
 'School_Segregation',
 'School_Funding_Adequacy',
 'Gender_Pay_Gap',
 'Median_Household_Income',
 'Children_Eligible_for_Free_or_Reduced_Price_Lunch',
 'Child_Care_Cost_Burden',
 'Child_Care_Centers',
 'Suicides',
 'Firearm_Fatalities',
 'Motor_Vehicle_Crash_Deaths',
 'Voter_Turnout',
 'Census_Participation',
 'Traffic_Volume',
 'Homeownership',
 'Severe_Housing_Cost_Burden',
 'Broadband_Access',
 'Population',
 'percent_Below_18_Years_of_Age',
 'percent_65_and_Older',
 'percent_Non_Hispanic_Black',
 'percent_American_Indian_or_Alaska_Native',
 'percent_Asian',
 'percent_Native_Hawaiian_or_Other_Pacific_Islander',
 'percent_Hispanic',
 'percent_Non_Hispanic_White',
 'percent_Not_Proficient_in_English',
 'percent_Female',
 'percent_Rural']
In [15]:
# Slice the dataframe
df = df[new_cols]
# Rename the columns
df = df.rename(columns=dict(zip(new_cols, cols)))
In [16]:
df.head(2)
Out[16]:
State_FIPS_Code County_FIPS_Code 5_digit_FIPS_Code State_Abbreviation Name Premature_Death Poor_or_Fair_Health Poor_Physical_Health_Days Poor_Mental_Health_Days Low_Birthweight ... percent_65_and_Older percent_Non_Hispanic_Black percent_American_Indian_or_Alaska_Native percent_Asian percent_Native_Hawaiian_or_Other_Pacific_Islander percent_Hispanic percent_Non_Hispanic_White percent_Not_Proficient_in_English percent_Female percent_Rural
0 statecode countycode fipscode state county v001_rawvalue v002_rawvalue v036_rawvalue v042_rawvalue v037_rawvalue ... v053_rawvalue v054_rawvalue v055_rawvalue v081_rawvalue v080_rawvalue v056_rawvalue v126_rawvalue v059_rawvalue v057_rawvalue v058_rawvalue
1 00 000 00000 US United States 7281.9355638 0.12 3 4.4 0.0819065527 ... 0.1682705801 0.1261202919 0.0131594526 0.0613162595 0.0026003593 0.1887563262 0.5930615866 0.0410440385 0.5047067187 0.193

2 rows × 82 columns

In [17]:
# remove the first row
df = df.drop([0])
df = df.reset_index(drop=True)
df.head(2)
Out[17]:
State_FIPS_Code County_FIPS_Code 5_digit_FIPS_Code State_Abbreviation Name Premature_Death Poor_or_Fair_Health Poor_Physical_Health_Days Poor_Mental_Health_Days Low_Birthweight ... percent_65_and_Older percent_Non_Hispanic_Black percent_American_Indian_or_Alaska_Native percent_Asian percent_Native_Hawaiian_or_Other_Pacific_Islander percent_Hispanic percent_Non_Hispanic_White percent_Not_Proficient_in_English percent_Female percent_Rural
0 00 000 00000 US United States 7281.9355638 0.12 3 4.4 0.0819065527 ... 0.1682705801 0.1261202919 0.0131594526 0.0613162595 0.0026003593 0.1887563262 0.5930615866 0.0410440385 0.5047067187 0.193
1 01 000 01000 AL Alabama 10350.071456 0.189 3.4824161407 5.0732772786 0.1043276003 ... 0.1763568833 0.2651199623 0.0071444204 0.0155043466 0.0010883202 0.0478519615 0.6487709918 0.0102759588 0.5142542169 0.409631829

2 rows × 82 columns

In [18]:
# Checking the states
df["State_Abbreviation"].unique()
Out[18]:
array(['US', 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL',
       'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
       'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM',
       'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
       'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
In [19]:
df[df["State_Abbreviation"] =="WY"].head(3)
Out[19]:
State_FIPS_Code County_FIPS_Code 5_digit_FIPS_Code State_Abbreviation Name Premature_Death Poor_or_Fair_Health Poor_Physical_Health_Days Poor_Mental_Health_Days Low_Birthweight ... percent_65_and_Older percent_Non_Hispanic_Black percent_American_Indian_or_Alaska_Native percent_Asian percent_Native_Hawaiian_or_Other_Pacific_Islander percent_Hispanic percent_Non_Hispanic_White percent_Not_Proficient_in_English percent_Female percent_Rural
3170 56 0 56000 WY Wyoming 7809.903503 0.115 2.698914 4.130766 0.090792 ... 0.179469 0.010394 0.028395 0.010935 0.001012 0.10554 0.833306 0.006424 0.48823 0.35242
3171 56 1 56001 WY Albany County 5133.53187 0.11 2.90064 4.179786 0.085394 ... 0.129866 0.012949 0.013162 0.034567 0.001409 0.101627 0.821581 0.006262 0.47817 0.119397
3172 56 3 56003 WY Big Horn County 9097.45733 0.123 2.998264 3.865339 0.069968 ... 0.217675 0.007479 0.018054 0.005416 0.000516 0.096114 0.867435 0.015205 0.491145 1.0

3 rows × 82 columns

The column where State_Abbreviation is US represent the country average and where State_Abbreviation is state name represent the state average.

County_FIPS_Code is 0 if county name is state itself.

Correct the data types¶

In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3194 entries, 0 to 3193
Data columns (total 82 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   State_FIPS_Code                                    3194 non-null   object
 1   County_FIPS_Code                                   3194 non-null   object
 2   5_digit_FIPS_Code                                  3194 non-null   object
 3   State_Abbreviation                                 3194 non-null   object
 4   Name                                               3194 non-null   object
 5   Premature_Death                                    3134 non-null   object
 6   Poor_or_Fair_Health                                3192 non-null   object
 7   Poor_Physical_Health_Days                          3192 non-null   object
 8   Poor_Mental_Health_Days                            3192 non-null   object
 9   Low_Birthweight                                    3088 non-null   object
 10  Adult_Smoking                                      3192 non-null   object
 11  Adult_Obesity                                      3192 non-null   object
 12  Food_Environment_Index                             3161 non-null   object
 13  Physical_Inactivity                                3192 non-null   object
 14  Access_to_Exercise_Opportunities                   3132 non-null   object
 15  Excessive_Drinking                                 3192 non-null   object
 16  Alcohol_Impaired_Driving_Deaths                    3167 non-null   object
 17  Sexually_Transmitted_Infections                    3071 non-null   object
 18  Teen_Births                                        3005 non-null   object
 19  Uninsured                                          3193 non-null   object
 20  Primary_Care_Physicians                            3047 non-null   object
 21  Dentists                                           3108 non-null   object
 22  Mental_Health_Providers                            2993 non-null   object
 23  Preventable_Hospital_Stays                         3123 non-null   object
 24  Mammography_Screening                              3173 non-null   object
 25  Flu_Vaccinations                                   3176 non-null   object
 26  High_School_Completion                             3194 non-null   object
 27  Some_College                                       3194 non-null   object
 28  Unemployment                                       3193 non-null   object
 29  Children_in_Poverty                                3193 non-null   object
 30  Income_Inequality                                  3187 non-null   object
 31  Children_in_Single_Parent_Households               3193 non-null   object
 32  Social_Associations                                3194 non-null   object
 33  Injury_Deaths                                      3089 non-null   object
 34  Air_Pollution___Particulate_Matter                 3167 non-null   object
 35  Drinking_Water_Violations                          3149 non-null   object
 36  Severe_Housing_Problems                            3194 non-null   object
 37  Driving_Alone_to_Work                              3194 non-null   object
 38  Long_Commute___Driving_Alone                       3194 non-null   object
 39  Life_Expectancy                                    3124 non-null   object
 40  Premature_Age_Adjusted_Mortality                   3134 non-null   object
 41  Frequent_Physical_Distress                         3192 non-null   object
 42  Frequent_Mental_Distress                           3192 non-null   object
 43  Diabetes_Prevalence                                3192 non-null   object
 44  HIV_Prevalence                                     2735 non-null   object
 45  Food_Insecurity                                    3194 non-null   object
 46  Limited_Access_to_Healthy_Foods                    3161 non-null   object
 47  Insufficient_Sleep                                 3192 non-null   object
 48  Uninsured_Adults                                   3193 non-null   object
 49  Uninsured_Children                                 3193 non-null   object
 50  Other_Primary_Care_Providers                       3183 non-null   object
 51  High_School_Graduation                             2362 non-null   object
 52  Reading_Scores                                     2826 non-null   object
 53  Math_Scores                                        2739 non-null   object
 54  School_Segregation                                 2962 non-null   object
 55  School_Funding_Adequacy                            3133 non-null   object
 56  Gender_Pay_Gap                                     3187 non-null   object
 57  Median_Household_Income                            3192 non-null   object
 58  Children_Eligible_for_Free_or_Reduced_Price_Lunch  2606 non-null   object
 59  Child_Care_Cost_Burden                             3192 non-null   object
 60  Child_Care_Centers                                 3044 non-null   object
 61  Suicides                                           2485 non-null   object
 62  Firearm_Fatalities                                 2323 non-null   object
 63  Motor_Vehicle_Crash_Deaths                         2743 non-null   object
 64  Voter_Turnout                                      3164 non-null   object
 65  Census_Participation                               3142 non-null   object
 66  Traffic_Volume                                     3041 non-null   object
 67  Homeownership                                      3194 non-null   object
 68  Severe_Housing_Cost_Burden                         3189 non-null   object
 69  Broadband_Access                                   3194 non-null   object
 70  Population                                         3194 non-null   object
 71  percent_Below_18_Years_of_Age                      3194 non-null   object
 72  percent_65_and_Older                               3194 non-null   object
 73  percent_Non_Hispanic_Black                         3194 non-null   object
 74  percent_American_Indian_or_Alaska_Native           3194 non-null   object
 75  percent_Asian                                      3194 non-null   object
 76  percent_Native_Hawaiian_or_Other_Pacific_Islander  3194 non-null   object
 77  percent_Hispanic                                   3194 non-null   object
 78  percent_Non_Hispanic_White                         3194 non-null   object
 79  percent_Not_Proficient_in_English                  3194 non-null   object
 80  percent_Female                                     3194 non-null   object
 81  percent_Rural                                      3187 non-null   object
dtypes: object(82)
memory usage: 2.0+ MB
In [22]:
print(df.head(2).T.to_string())
                                                               0             1
State_FIPS_Code                                               00            01
County_FIPS_Code                                             000           000
5_digit_FIPS_Code                                          00000         01000
State_Abbreviation                                            US            AL
Name                                               United States       Alabama
Premature_Death                                     7281.9355638  10350.071456
Poor_or_Fair_Health                                         0.12         0.189
Poor_Physical_Health_Days                                      3  3.4824161407
Poor_Mental_Health_Days                                      4.4  5.0732772786
Low_Birthweight                                     0.0819065527  0.1043276003
Adult_Smoking                                               0.16         0.195
Adult_Obesity                                               0.32         0.393
Food_Environment_Index                                         7           5.3
Physical_Inactivity                                         0.22         0.278
Access_to_Exercise_Opportunities                    0.8423863046  0.6092667226
Excessive_Drinking                                          0.19  0.1614162693
Alcohol_Impaired_Driving_Deaths                     0.2655507901   0.258869637
Sexually_Transmitted_Infections                            481.3         552.2
Teen_Births                                         19.300572586  27.598889304
Uninsured                                           0.1044496729  0.1182271569
Primary_Care_Physicians                             0.0007637606  0.0006579252
Dentists                                            0.0007246807  0.0004869166
Mental_Health_Providers                             0.0029570126  0.0012541973
Preventable_Hospital_Stays                                  2809          3599
Mammography_Screening                                       0.37          0.36
Flu_Vaccinations                                            0.51          0.44
High_School_Completion                              0.8887404032  0.8740270016
Some_College                                        0.6725325979  0.6150082742
Unemployment                                        0.0535291312  0.0343902829
Children_in_Poverty                                        0.169         0.227
Income_Inequality                                   4.8913749294  5.1766763312
Children_in_Single_Parent_Households                0.2512967212  0.3090921916
Social_Associations                                 9.1296963648  11.910925297
Injury_Deaths                                       75.899512272    86.9057184
Air_Pollution___Particulate_Matter                           7.4           9.3
Drinking_Water_Violations                                    NaN  0.1343283582
Severe_Housing_Problems                             0.1696721824  0.1315678879
Driving_Alone_to_Work                                0.732358592  0.8378249329
Long_Commute___Driving_Alone                               0.365          0.35
Life_Expectancy                                     78.528894654   74.83594896
Premature_Age_Adjusted_Mortality                     358.7460227  499.86855039
Frequent_Physical_Distress                                  0.09  0.1107739678
Frequent_Mental_Distress                                    0.14  0.1648429623
Diabetes_Prevalence                                         0.09          0.13
HIV_Prevalence                                             379.7         341.6
Food_Insecurity                                            0.118         0.145
Limited_Access_to_Healthy_Foods                     0.0610019647  0.0876054853
Insufficient_Sleep                                          0.33  0.3924300962
Uninsured_Adults                                     0.123766561  0.1491000099
Uninsured_Children                                  0.0539542665  0.0362680404
Other_Primary_Care_Providers                        0.0012318702  0.0010861376
High_School_Graduation                                      0.87  0.9071081634
Reading_Scores                                            3.0534   2.885602535
Math_Scores                                                3.003    2.72218766
School_Segregation                                        0.2454  0.2817412656
School_Funding_Adequacy                                     1062     -3868.511
Gender_Pay_Gap                                      0.8100444614  0.7418970988
Median_Household_Income                                    69717         53990
Children_Eligible_for_Free_or_Reduced_Price_Lunch   0.5308547682    0.53338294
Child_Care_Cost_Burden                              0.2659357065  0.2722218184
Child_Care_Centers                                  6.8638668282  5.5092316855
Suicides                                            13.818282988  16.200669652
Firearm_Fatalities                                  12.430330228  22.293899524
Motor_Vehicle_Crash_Deaths                          11.591311264  20.205514853
Voter_Turnout                                       0.6790952146  0.6263600041
Census_Participation                                       0.652           NaN
Traffic_Volume                                            505.31  213.69282656
Homeownership                                        0.646331101  0.6939478703
Severe_Housing_Cost_Burden                          0.1427574897  0.1194424811
Broadband_Access                                    0.8700069587  0.8204571454
Population                                             331893745       5039877
percent_Below_18_Years_of_Age                       0.2216565817  0.2226744819
percent_65_and_Older                                0.1682705801  0.1763568833
percent_Non_Hispanic_Black                          0.1261202919  0.2651199623
percent_American_Indian_or_Alaska_Native            0.0131594526  0.0071444204
percent_Asian                                       0.0613162595  0.0155043466
percent_Native_Hawaiian_or_Other_Pacific_Islander   0.0026003593  0.0010883202
percent_Hispanic                                    0.1887563262  0.0478519615
percent_Non_Hispanic_White                          0.5930615866  0.6487709918
percent_Not_Proficient_in_English                   0.0410440385  0.0102759588
percent_Female                                      0.5047067187  0.5142542169
percent_Rural                                              0.193   0.409631829

We can convert most of the columns into float type.

In [21]:
# Fill the NaN with np.nan
df.fillna(np.nan, inplace =True)
In [22]:
# list of cols to convert into float
to_float= [col for col in list(df.columns) if col not in list(df.columns[3:5])]
df[to_float] = df[to_float].apply(pd.to_numeric)
In [23]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3194 entries, 0 to 3193
Data columns (total 82 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   State_FIPS_Code                                    3194 non-null   int64  
 1   County_FIPS_Code                                   3194 non-null   int64  
 2   5_digit_FIPS_Code                                  3194 non-null   int64  
 3   State_Abbreviation                                 3194 non-null   object 
 4   Name                                               3194 non-null   object 
 5   Premature_Death                                    3134 non-null   float64
 6   Poor_or_Fair_Health                                3192 non-null   float64
 7   Poor_Physical_Health_Days                          3192 non-null   float64
 8   Poor_Mental_Health_Days                            3192 non-null   float64
 9   Low_Birthweight                                    3088 non-null   float64
 10  Adult_Smoking                                      3192 non-null   float64
 11  Adult_Obesity                                      3192 non-null   float64
 12  Food_Environment_Index                             3161 non-null   float64
 13  Physical_Inactivity                                3192 non-null   float64
 14  Access_to_Exercise_Opportunities                   3132 non-null   float64
 15  Excessive_Drinking                                 3192 non-null   float64
 16  Alcohol_Impaired_Driving_Deaths                    3167 non-null   float64
 17  Sexually_Transmitted_Infections                    3071 non-null   float64
 18  Teen_Births                                        3005 non-null   float64
 19  Uninsured                                          3193 non-null   float64
 20  Primary_Care_Physicians                            3047 non-null   float64
 21  Dentists                                           3108 non-null   float64
 22  Mental_Health_Providers                            2993 non-null   float64
 23  Preventable_Hospital_Stays                         3123 non-null   float64
 24  Mammography_Screening                              3173 non-null   float64
 25  Flu_Vaccinations                                   3176 non-null   float64
 26  High_School_Completion                             3194 non-null   float64
 27  Some_College                                       3194 non-null   float64
 28  Unemployment                                       3193 non-null   float64
 29  Children_in_Poverty                                3193 non-null   float64
 30  Income_Inequality                                  3187 non-null   float64
 31  Children_in_Single_Parent_Households               3193 non-null   float64
 32  Social_Associations                                3194 non-null   float64
 33  Injury_Deaths                                      3089 non-null   float64
 34  Air_Pollution___Particulate_Matter                 3167 non-null   float64
 35  Drinking_Water_Violations                          3149 non-null   float64
 36  Severe_Housing_Problems                            3194 non-null   float64
 37  Driving_Alone_to_Work                              3194 non-null   float64
 38  Long_Commute___Driving_Alone                       3194 non-null   float64
 39  Life_Expectancy                                    3124 non-null   float64
 40  Premature_Age_Adjusted_Mortality                   3134 non-null   float64
 41  Frequent_Physical_Distress                         3192 non-null   float64
 42  Frequent_Mental_Distress                           3192 non-null   float64
 43  Diabetes_Prevalence                                3192 non-null   float64
 44  HIV_Prevalence                                     2735 non-null   float64
 45  Food_Insecurity                                    3194 non-null   float64
 46  Limited_Access_to_Healthy_Foods                    3161 non-null   float64
 47  Insufficient_Sleep                                 3192 non-null   float64
 48  Uninsured_Adults                                   3193 non-null   float64
 49  Uninsured_Children                                 3193 non-null   float64
 50  Other_Primary_Care_Providers                       3183 non-null   float64
 51  High_School_Graduation                             2362 non-null   float64
 52  Reading_Scores                                     2826 non-null   float64
 53  Math_Scores                                        2739 non-null   float64
 54  School_Segregation                                 2962 non-null   float64
 55  School_Funding_Adequacy                            3133 non-null   float64
 56  Gender_Pay_Gap                                     3187 non-null   float64
 57  Median_Household_Income                            3192 non-null   float64
 58  Children_Eligible_for_Free_or_Reduced_Price_Lunch  2606 non-null   float64
 59  Child_Care_Cost_Burden                             3192 non-null   float64
 60  Child_Care_Centers                                 3044 non-null   float64
 61  Suicides                                           2485 non-null   float64
 62  Firearm_Fatalities                                 2323 non-null   float64
 63  Motor_Vehicle_Crash_Deaths                         2743 non-null   float64
 64  Voter_Turnout                                      3164 non-null   float64
 65  Census_Participation                               3142 non-null   float64
 66  Traffic_Volume                                     3041 non-null   float64
 67  Homeownership                                      3194 non-null   float64
 68  Severe_Housing_Cost_Burden                         3189 non-null   float64
 69  Broadband_Access                                   3194 non-null   float64
 70  Population                                         3194 non-null   int64  
 71  percent_Below_18_Years_of_Age                      3194 non-null   float64
 72  percent_65_and_Older                               3194 non-null   float64
 73  percent_Non_Hispanic_Black                         3194 non-null   float64
 74  percent_American_Indian_or_Alaska_Native           3194 non-null   float64
 75  percent_Asian                                      3194 non-null   float64
 76  percent_Native_Hawaiian_or_Other_Pacific_Islander  3194 non-null   float64
 77  percent_Hispanic                                   3194 non-null   float64
 78  percent_Non_Hispanic_White                         3194 non-null   float64
 79  percent_Not_Proficient_in_English                  3194 non-null   float64
 80  percent_Female                                     3194 non-null   float64
 81  percent_Rural                                      3187 non-null   float64
dtypes: float64(76), int64(4), object(2)
memory usage: 2.0+ MB
In [24]:
df.describe()
Out[24]:
State_FIPS_Code County_FIPS_Code 5_digit_FIPS_Code Premature_Death Poor_or_Fair_Health Poor_Physical_Health_Days Poor_Mental_Health_Days Low_Birthweight Adult_Smoking Adult_Obesity ... percent_65_and_Older percent_Non_Hispanic_Black percent_American_Indian_or_Alaska_Native percent_Asian percent_Native_Hawaiian_or_Other_Pacific_Islander percent_Hispanic percent_Non_Hispanic_White percent_Not_Proficient_in_English percent_Female percent_Rural
count 3194.000000 3194.000000 3194.000000 3134.000000 3192.000000 3192.000000 3192.000000 3088.000000 3192.000000 3192.000000 ... 3194.000000 3194.000000 3194.000000 3194.000000 3194.000000 3194.000000 3194.000000 3194.000000 3194.000000 3187.000000
mean 30.249530 101.886662 30351.417032 8891.562734 0.159942 3.511726 4.794971 0.082138 0.199762 0.361428 ... 0.199929 0.090869 0.024611 0.016971 0.001625 0.102183 0.749862 0.016072 0.495715 0.580467
std 15.160981 107.624838 15179.045587 2929.948857 0.044333 0.652486 0.628114 0.020293 0.041210 0.046825 ... 0.047879 0.141564 0.077649 0.030939 0.009667 0.139670 0.202763 0.026852 0.023189 0.315553
min 0.000000 0.000000 0.000000 3090.426825 0.065000 1.849017 2.779181 0.028871 0.067000 0.176000 ... 0.050729 0.000000 0.000000 0.000000 0.000000 0.006827 0.026802 0.000000 0.245614 0.000000
25% 18.000000 33.000000 18171.500000 6868.647904 0.125000 3.027309 4.373272 0.068281 0.174000 0.336000 ... 0.169189 0.008182 0.004311 0.005208 0.000377 0.026999 0.630136 0.002579 0.490583 0.325275
50% 29.000000 77.000000 29174.000000 8538.518058 0.152000 3.448386 4.813037 0.079532 0.198000 0.366000 ... 0.195519 0.024266 0.007193 0.008099 0.000721 0.048823 0.821402 0.007069 0.499580 0.588250
75% 45.000000 133.000000 45074.500000 10494.403953 0.189000 3.946273 5.221064 0.091418 0.226000 0.391000 ... 0.225286 0.104902 0.014716 0.015733 0.001357 0.108941 0.915241 0.017709 0.507127 0.861214
max 56.000000 840.000000 56045.000000 30007.870277 0.368000 6.335031 6.945581 0.216981 0.411000 0.532000 ... 0.581710 0.856197 0.922567 0.420553 0.475610 0.962604 0.975921 0.384369 0.570535 1.000000

8 rows × 80 columns

Plotting the data¶

Relationship between sleep and obesity in LA and CA

In [25]:
x = "Adult_Obesity"
y = "Insufficient_Sleep"
z = "State_Abbreviation"
not_null_mask = df[[x,y,z]].notnull().all(axis=1)
not_null_rows = df[[x,y,z]][not_null_mask]

not_null_rows = not_null_rows.query('State_Abbreviation== "LA" or State_Abbreviation== "CA"')
In [26]:
sns.scatterplot(data=not_null_rows, x = x, y = y, hue = z)
Out[26]:
<Axes: xlabel='Adult_Obesity', ylabel='Insufficient_Sleep'>

Checking correlations between few columns¶

In [27]:
sns.scatterplot(data=df, x = "Broadband_Access", y = "Math_Scores")
Out[27]:
<Axes: xlabel='Broadband_Access', ylabel='Math_Scores'>

Splitting the columns into health factors(variables) and healt outcomes types(target)

In [28]:
target_cols = ['Premature_Death', 'Life_Expectancy', 'Premature_Age_Adjusted_Mortality', 
'Poor_or_Fair_Health','Poor_Physical_Health_Days', 'Poor_Mental_Health_Days','Low_Birthweight', 
'Frequent_Physical_Distress','Frequent_Mental_Distress', 'Diabetes_Prevalence', 'HIV_Prevalence']
In [29]:
variable_cols = [x for x in df.columns[5:] if x not in target_cols]
In [30]:
df_corr = df.iloc[:,5:].corr()
df_corr.shape
Out[30]:
(77, 77)

Correlation measurements¶

In [31]:
df_corr = df_corr[variable_cols]
df_corr = df_corr.loc[target_cols]
df_corr.shape
Out[31]:
(11, 66)
In [36]:
sns.heatmap(df_corr.T, annot = True, annot_kws={"fontsize":7})
plt.xticks(fontsize=8)
plt.yticks(fontsize=9)
sns.set(rc={'figure.figsize':(10,15)})

Finding the features obesity is most correlated to

In [33]:
obesity_corr = list(df.iloc[:, 5:].corr()[["Adult_Obesity"]].sort_values(by = "Adult_Obesity").index)
obesity_corr = obesity_corr[:5] + obesity_corr[-7:-1]
obesity_corr
Out[33]:
['Life_Expectancy',
 'Median_Household_Income',
 'Some_College',
 'Voter_Turnout',
 'Broadband_Access',
 'Premature_Age_Adjusted_Mortality',
 'Frequent_Physical_Distress',
 'Adult_Smoking',
 'Poor_or_Fair_Health',
 'Diabetes_Prevalence',
 'Physical_Inactivity']

Few more plots

In [42]:
sns.scatterplot(data=df, x = "Median_Household_Income", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})
In [43]:
sns.scatterplot(data=df, x = "Adult_Smoking", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})

Extracting state average¶

In [44]:
state_df = df[ pd.to_numeric(df["County_FIPS_Code"]) == 0]

Plot the obesity rates among all the states including national average¶

In [46]:
sns.barplot(state_df.sort_values(by = ["Adult_Obesity"]), x="Adult_Obesity", y="State_Abbreviation")
plt.ylabel("State Names")
plt.xlabel("Adult obesity")
plt.yticks(fontsize=8)
plt.title("Obesity rates among adults in different US states", {'fontsize': 20} )
sns.set(rc={'figure.figsize':(10,9)})

Closing Thoughts and Final Goals¶

I plan to explore the datasets why some states or counties are good in health comes and why others are not. Other questions include, "what factors influence the health outcomes the most?","What affects the obesity most?", "Does the state/county location matter in health outcome?","why certain demograohic has a correlation with health results?" and so on.

Besides, it would be cool to explore how political preferences affect the health status of a certain area. Directly or indirectly, there will be some influence in the policies, which futher influences the general public's health behaviors.

Similarly, socailly vulnerability might have association to health outcomes too. So, I plan to explore more into it.

Hopefully, I can find more data and variables to merge with this one, and with better data analysis, I could figure what variables to include in a model. Here, the model will be used to predict the health outcome such as mortality or obesity based on easily available dataset.

In [ ]: